Automatic Web Scraper

Web Scraping

Github Actionsgit

using R and GitHub Actions

Author

Affiliation

Xinzhuo Huang

HKUST SOSC

Published

November 2, 2023

Modified

November 24, 2023

Let’s create a automatic web scraper using R and Github actions to scrape real-time Weibo hot searches (akin to Twitter).

Github Actions workflow

Set up a GitHub Actions workflow to create a scheduled task that launches every 30 minutes.

Code

on:
  schedule:
   - cron: '*/30 * * * *' 

jobs:
  update-report:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    
    steps:
      - name: Set up R
        uses: r-lib/actions/setup-r@v2
        
      - name: Check out repository
        uses: actions/checkout@v3
      
      - uses: actions/cache@v3 # Cache packages so won't be compiled everytime job is run
        with:
          path: ~/.local/share/renv
          key: ${{ runner.os }}-renv-${{ hashFiles('**/renv.lock') }}
          restore-keys: |
            ${{ runner.os }}-renv-

      - name: Install packages
        uses: r-lib/actions/setup-r-dependencies@v2
        with:
          packages: |
            any::tidyverse
            any::rio
            any::rvest
            any::httr2
            any::jsonlite
            any::pacman
            
      - name: Web Scraping
        run: source("weibo_realtime.R")
        shell: Rscript {0}
        
      - name: Commit files
        run: |
          git config --local user.name actions-user
          git config --local user.email "actions@github.com"
          git add data/*
          git commit -am "GH ACTION Headlines $(date)"
          git push origin main
        env:
          REPO_KEY: ${{secrets.GITHUB_TOKEN}}
          username: github-actions

Deploy this web scraper to GitHub, enabling it to automatically scrape Weibo hot searches.

Code

library("pacman")
p_load(tidyverse, httr2, rvest)

get_hot_item <- possibly(
  insistently(
    \() {
      result <- "sample_website" %>%
        request() %>%
        req_timeout(1000) %>%
        req_headers(
          "User-Agent" = "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
          "Cookie" = "SUB=_2AkMXxWiSf8NxqwFRmPoWz2nlbop1zwvEieKhmZlJJRMxHRl-yT9jqlAItRB6PEVGfTP09XmsX_7CR2H1OUv6b-f-1bJl;SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWENAjmKyIZz1AWjDi68mRw"
        ) %>%
        req_retry(
          max_tries = 5,
          max_seconds = 60,
          backoff = ~ 2
        ) %>%
        req_perform() %>%
        pluck("body") %>%
        read_html() %>%
        html_element("tbody") %>%
        html_table()
      
      return(result)
    },
    rate = rate_backoff(
      pause_base = 3,
      pause_cap = 60,
      max_times = 3,
      jitter = TRUE
    ),
    quiet = FALSE
  ),
  otherwise = "error!"
)

get_hot_item()

Citation

BibTeX citation:

@online{xinzhuo2023,
  author = {Xinzhuo, Huang},
  title = {Automatic {Web} {Scraper}},
  date = {2023-11-02},
  url = {https://xinzhuo.work/blog/github action R},
  langid = {en}
}

For attribution, please cite this work as:

Xinzhuo, Huang. 2023. “Automatic Web Scraper.” November 2, 2023. https://xinzhuo.work/blog/github action R.